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Abstract — We consider an agent interacting with an unmodeled 
environment. At each time, the agent makes an observation, takes an 
action, and incurs a cost. Its actions can influence future observations 
and costs. The goal is to minimize the long-term average cost. We 
propose a novel algorithm, known as the active LZ algorithm, for 
optimal control based on ideas from the Lempel-Ziv scheme for 
universal data compression and prediction. We establish that, under 
the active LZ algorithm, if there exists an integer K such that the 
future is conditionally independent of the past given a window of K 
consecutive actions and observations, then the average cost converges 
to the optimum. Experimental results involving the game of Rock- 
Paper-Scissors illustrate merits of the algorithm. 

Index Terms — Lempel-Ziv, context tree, optimal control, reinforce- 
ment learning, dynamic programming, value iteration. 



I. Introduction 

CONSIDER an agent that, at each integer time t, makes 
an observation Xt from a finite observation space X, and 
takes an action At selected from a finite action space A. The 
agent incurs a bounded cost g{Xt, At, Xt+i) £ [-gmax, .gmax]- 
The goal is to minimize the long-term average cost 



lim sup E 



1 ^ 

-Y,9{Xt.At,Xt+^) 



t=i 



Here, the expectation is over the randomness in the X* 
process', and, at each time t, the action At is selected as 
a function of the prior observations X* and the prior actions 

A'-\ 

We will propose a general action-selection strategy called 
the active LZ algorithm. In addition to the new strategy, a 
primary contribution of this paper is a theoretical guarantee 
that this strategy attains optimal average cost under weak 
assumptions about the environment. The main assumption 
is that there exists an integer K such that the future is 
conditionally independent of the past given a window of K 
consecutive actions and observations. In other words. 



Vv{Xt = xt\Tt-i) = P{xt\Xlz]i,A 



(1) 



Manuscript received July 20. 2007; revised June 8, 2009. The first author 
was supported by a supplement to NSF Grant ECS-9985229 provided by the 
MKIDS Program. The second author was supported by a Benchmai'k Stanford 
Graduate Fellowship. 

V. F. Farias is with the Sloan School of Management, Massachusetts Institute 
of Technology, Cambridge, MA, 02139 USA (e-mail: vivekf@mit.edu ) 

C. C. Moallemi is with the Graduate School of Business, Columbia 
University, New York, NY, 10027 USA (e-mail: ciamac@gsb.columbia.edu i. 

B. Van Roy is with the Departments of Management Science & Engineering 
and Electrical Engineering, Stanford University. Stanford, CA 94305 USA 
(e-mail: bvr@stanford.edu I. 

T. Weissman is with the Department of Electrical Engineering, Stanford 
University, Stanford, CA 94305 USA (e-mail: tsachy @ stanfo rd.edu) . 

'For a sequence such as {Xt}, X* denotes the vector {Xs, ■ ■ ■ ,Xt). We 
also use the notation X* = X^. 



where P is a transition kernel and Tt is the a-algebra 
generated by {X*-,A*-). We are particularly interested in 
situations where neither P nor even K are known to the agent. 
That is, where there is a finite but unknown dependence on 
history. 

Consider the following examples, which fall into the above 
formalism. 

Example 1 (Rock-Paper-Scissors). Rock-Paper-Scissors is a 
two-person, zero-sum matrix game that has a rich history as a 
reinforcement learning problem. The two players play a series 
of games indexed by the integer t. Each player must generate 
an action — rock, paper, or scissors — for each game. He then 
observes his opponent's hand and incurs a cost of —1, 1, or 
0, depending on whether the pair of hands results in a win, 
loss, or draw. The game is played repeatedly and the player's 
objective is to minimize the average cost. 

Define Xt to be the opponent's choice of action in game t, 
and At-i to be the player's choice of action in game t. The 
action and observation spaces for this game are 

A = X = {rock, paper, scissors}. 

Identifying these with the integers {1, 2, 3}, the cost function 
is 

'0 1-1" 
-1 1 
1 -1 



g{xt,at,Xt+i) 



Assuming that the opponent uses a mixed strategy that depends 
only on information from the last K — 1 games, such a strategy 
defines a transition kernel P over the opponent's play Xt in 
game t of the form ([T). (Note that such a P has special 
structure in that, for example, it has no dependence on the 
player's action At-i in game t, since this is unknown to the 
opponent until after game t is played.) Then, the problem 
of finding the optimal strategy against an unknown, finite- 
memory opponent falls within our framework. 

Example 2 (Joint Source-Channel Coding with a Fixed 
Decoder). Let § and Y be finite source and channel alphabets, 
respectively. Consider a sequence of symbols {St} from the 
source alphabet § which are to be encoded for transmission 
across a channel. Let G Y represent the choice of 
encoding at time t, and let € Y be the symbol received 
after corruption by the channel. We will assume that this 
channel has a finite memory of order M. In other words, the 



distribution of Y* is a function of 



t-M+V 



For all times t, let 



d : Y^ ^ § be some fixed decoder that decodes the symbol at 
time t based on the most recent L symbols received Y*_j^^^. 
Given a single letter distortion measure p : § x § ^ K, define 
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the expected distortion at time t by 



9{st,yl-L-M+2) 



t-L+l 



- 1/* 



The optimization problem is to find a sequence of functions 
where each function /it : X* ^ A specifies an encoder 
at time t, so as to minimize the long-term average distortion 



lim sup 



T ^ 

t=l 



Assume that the source is Markov of order N, but that both 
the transition probabilities for the source and the order N 
are unknown. Setting K = max(L + Af — 1, N), define the 
observation at time t to be the vector Xt = 
and the action at time t to be At = Yt- Then, optimal 
coding problem at hand falls within our framework (cf. |T| 
and references therein). 

With knowledge of the kernel P (or even just the order of 
the kernel, K), solving for the average cost optimal policy 
in either of the examples above via dynamic programming 
methods is relatively straightforward. This paper develops an 
algorithm that, without knowledge of the kernel or its order, 
achieves average cost optimality. The active LZ algorithm we 
develop consists of two broad components. The first is an 
efficient data structure, a context tree on the joint process 
{X*- , A*-^^), to store information relevant to predicting the 
observation at time t + \, Xt+i, given the history available 
up to time t and the action selected at time t. At- Our 
prediction methodology borrows heavily from the Lempel-Ziv 
algorithm for data compression [2|. The second component of 
our algorithm is a dynamic programming scheme that, given 
the probabilistic model determined by the context tree, selects 
actions so as to minimize costs over a suitably long horizon. 
Absent knowledge of the order of the kernel, K, the two 
tasks above — building a context tree in order to estimate the 
kernel, and selecting actions that minimize long-term costs — 
must be done continually in tandem which creates an important 
tension between 'exploration' and 'exploitation'. In particular, 
on the one hand, the algorithm must select actions in a manner 
that builds an accurate context tree. On the other hand, the 
desire to minimize costs naturally restricts this selection. By 
carefully balancing these two tensions our algorithm achieves 
an average cost equal to that of an optimal policy with full 
knowledge of the kernel P. 

Related problems have been considered in the literature. 
Kearns and Singh |3| present an algorithm for reinforcement 
learning in a Markov decision process. This algorithm can 
be applied in our context when K is known, and asymptotic 
optimality is guaranteed. More recently, Even-Dar et al. I!) 
present an algorithm for optimal control of partially observable 
Markov decision processes, a more general setting than what 
we consider here, and are able to establish theoretical bounds 
on convergence time. The algorithm there, however, seems 
difficult and unrealistic to implement in contrast with what 
we present here. Further, it relies on knowledge of a constant 
related to the amount of time a 'homing' policy requires 



to achieve equilibrium. This constant may be challenging to 
estimate. 

Work by de Farias and Megiddo [5] considers an optimal 
control framework where the dynamics of the environment 
are not known and one wishes to select the best of a finite 
set of 'experts'. In contrast, our problem can be thought 
of as competing with the set of all possible strategies. The 
prediction problem for loss functions with memory and a 
Markov-modulated source considered by Merhav et al. |6| is 
essentially a Markov decision problem as the authors point 
out; again, in this case, knowing the structure of the loss 
function implicitly gives the order of the underlying Markov 
process. 

The active LZ algorithm is inspired by the Lempel-Ziv 
algorithm. This algorithm has been extended to address many 
problems, such as prediction Q, |[8l and filtering 16 |. In almost 
all cases, however, future observations are not influenced by 
actions taken by the algorithm. This is in contrast to the 
active LZ algorithm, which proactively anticipates the effect 
of actions on future observations. An exception is the work of 
Vitter and Krishnan |9|, which considers cache pre-fetching 
and can be viewed as a special case of our formulation. 

The Lempel-Ziv algorithm and its extensions revolve around 
a context tree data structure that is constructed as observations 
are made. This data structure is simple and elegant from 
an implementational point of view. The use of this data 
structure in reinforcement learning represents a departure from 
representations of state and belief state commonly used in 
the reinforcement learning literature. Such data structures 
have proved useful in representing experience in algorithms 
for engineering applications ranging from compression to 
prediction to denoising. Understanding whether and how some 
of this value can be extended to reinforcement learning is the 
motivation for this paper. 

The remainder of this paper is organized as follows. In 
Section |II] we formulate our problem precisely. In Section [ni] 
we present our algorithm and provide computational results 
in the context of the rock-paper-scissors example. Our main 
result, as stated in Theorem |2] in Section |IV] is that the 
algorithm is asymptotically optimal. Section |V] concludes. 

II. Problem Formulation 

Recall that we are endowed with finite action and observa- 
tion spaces A and X, respectively, and we have 

Vr{Xt^xt\Tt-i) = P{xt\xlz],,A\z],), 

where P is a stochastic transition kernel. A policy /i is a 
sequence of mappings {/if}, where for each time t the map 
fjLt '■ X* X A*~^ ^ A determines which action shall be 
chosen at time t given the history of observations and actions 
observed up to time t. In other words, under policy /i, actions 
will evolve according to the rule 

At^lJit{X\A'~^). 

We will call a policy ji stationary if 

^Jit{X\A'-^) = ^i{Xl_l,^„A\-_],+,), for all t > K, 
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for some function /i 



K 



Such a policy 



selects actions in a manner that depends only one the current 
observation Xt and the observations and actions over the most 
recent K time steps. It is clear that for a fixed stationary 
policy /i, the observations and actions for time t > K evolve 
according to a Markov chain on the finite state space x 
A-'^"^. Given an initial state {x^ ,a^^^), we can define the 
average cost associated with the stationary policy /i by 



= lim E„ 



1 ^ 
t=i 



Here, the expectation is conditioned on the initial state 
{X^,A^^'^) = (2:^,0^"^). Since the underlying state- 
space, X^*^ X A^^^, is finite, the above limit always exists 
IfTOl Proposition 4.1.2]. Since there are finitely many stationary 
policies, we can define the optimal average cost over stationary 
policies by 



A*(x",a^-0 " mill A,,(a;",a"-^), 

where the minimum is taken over the set of all stationary 
policies. Again, because of the finiteness of the underlying 
state space. A* is also the optimal average cost that can be 
achieved using any policy, stationary or not. In other words. 



A*(x^,a^-^) 



= inf lim sup E,, 



1 ^ 

-Y.g{Xt,A,Xt+,) 
t=i 



(2) 



where the infimum is taken over the set of all policies ifTOl 
Proposition 4.1.7]. 

We next make an assumption that will enable us to 
streamline our analysis in subsequent sections. 

Assumption 1. The optimal average cost is independent of 
the initial state. That is, there exists a constant A* so that 



,K-1 



A*(x^,a"-0 = A*, V (x",a^-^) e X" X A 

The above assumption is benign and is satisfied for any 
strictly positive kernel P, for example. More generally, such an 
assumption holds for the class of problems satisfying a 'weak 
accessibility' condition (see Bertsekas |10| for a discussion 
of the structural properties of average cost Markov decision 
problems). In the context of our problem, it is difficult to 
design controllers that achieve optimal average cost in the 
absence of such an assumption. In particular, if there exist 
policies under which the chain has multiple recurrent classes, 
then the optimal average cost may well depend on the initial 
state and actions taken very early on might play a critical role 
in achieving this performance. We note that in such cases the 
assumption above (and our subsequent analysis) is valid for 
the recurrent class our controller eventually enters. 

If the transition kernel P (and, thereby, K) were known, 
dynamic programming is a means to finding a stationary policy 
that achieves average cost A* . One approach would be to find 



a solution J 
equation 

J{x 



X A 



K-l 



to the discounted Bellman 



= min P{xk+i\x^ 



(3) 



X [g(xfe,afe,a;K+i) + aJ(a;f+\af )], 



for all {x^ 



e X 



K 



,K-1 



Here, a G (0, 1) is a 



discount factor. If the discount factor alpha is chosen to be 
sufficiently close to 1, a solution J* (known as the cost-to-go 
function) to the Bellman equation can be used to define an 
optimal stationary policy for the original, average-cost problem 
©. In particular, for all (x^,a^'~i) e X-^^ x A^^-\ define 
the set ^*(x^,a^"^) of a-discounted optimal actions to be 
the set of minimizers to the optimization program 

min P{xk+i\x^ , a^) 

Xk+1 (4) 

X [g{xK,aK,XK+i) + aJ^{x2'^'^,a2)]- 

At a give time t, A^{Xj:_j^^-^^, Alz]<c+i) is the set of actions 
obtained acting greedily with respect to J*. These actions 
seek to minimize the expected value of the immediate cost 
g{Xt, At, Xt+i) at the current time, plus a continuation cost, 
which quantifies the impact of the current decision on all 
future costs, and is captured by J*. 

If a is sufficiently close to 1, and /i* is a policy such that 
for t > K, 



ti;iX\A'-')eAl{Xl 



At-L 



(5) 



then, /i* will achieve the optimal average cost A*. Such a 
policy /i* is sometimes called a Blackwell optimal policy 

na. 

We return to our example of the game of Rock-Paper- 
Scissors, to illustrate the above approach. 

Example [T] (Rock-Paper-Scissors). Given knowledge of the 
opponent's (finite-memory) strategy and, thus the transition 
kernel P, the Bellman equation (O can be solved for the 
optimal cost-to-go function J*. Then, an optimal policy for 
the player would be, for each game < + 1, to select an 
action At according to (|4|i-(|5]l. This action is a function 
of the entire history of game play only through the se- 
quence {Xl_p,j^^, A\z\;j^i) of recent game play. The action 
is selected by optimally accounting for both the expected 
immediate cost g{Xt, At, Xt+i) of the game at hand, and 
the impact of the choice of action towards all future games 
(through the cost-to-go function J*). 

III. A Universal Scheme 

Direct solution of the Bellman equation (O requires knowl- 
edge of the transition kernel P. Algorithm [1] the active LZ 
algorithm, is a method that requires no knowledge of P, or 
even of K. Instead, it simultaneously estimates a probabilistic 
model for the evolution of the system and develops an optimal 
control for that model, along the course of a single system 
trajectory. At a high-level, the two critical components of 
the active LZ algorithm are the estimates P and J. P is our 
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estimate of the true kernel P. This estimate is computed using 
variable length contexts to dynamically build higher order 
models of the underlying process, in a manner reminiscent 
of the Lempel-Ziv scheme used for universal prediction. J 
is the estimate to the optimal cost-to-go function J* that is 
the solution to the Bellman equation O). It is computed in a 
fashion similar to the value iteration approach to solving the 
Bellman equation equation (see fTOl). Given the estimates 
P and J, the algorithm randomizes to strike a balance 
selecting actions so as to improve the quality of the estimates 
(exploration) and acting greedily with respect to the estimates 
so as to minimize the costs incurred (exploitation). 

The active LZ algorithm takes as inputs a discount factor 
a G (0, 1), sufficiently close to 1, and a sequence of 
exploration probabilities {7*}. The algorithm proceeds as 
follows: time is parsed into intervals, or 'phrases', with the 
property that if the cth phrase covers the time intervals 
Tc < t < Tc+i — 1, then the observation/action sequence 
{Xr^^^ ^,AtI'^^ ^) will not have occurred as the prefix of 
any other phrase before time Tc- 

At any point in time t, if the current phrase started at 
time Tc, the sequence {X^^, A'^~^) defines a context which is 
used to estimate transition probabilities and cost-to-go function 
values. To be precise, given a sequence of observations and 
actions (x^,a^^^), we say the context at time t is (x^,a^^^) 
if {Xl^,A*-^) = (a;^a^-l). For each xi+i G X and ai £ A, 
the algorithm maintains an estimate P{xe+i\x^ ,a^) of the 
probability of observing Xt+i — xe+i at the next time 
step, given the choice of action At ~ ae and the current 
context (X* ,A*^^^) = {x^,a^~^). This transition probability 
is initialized to be uniform, and subsequently updated using an 
empirical estimator based on counts for various realizations of 
Xt^i at prior visits to the context in question. If N{x^^^, a^) 
is the number of times the context {x^^^, a^) has been visited 
prior to time t, then the estimate 

7V(x^+i,aO + 1/2 



P{xe+,\x',a') 



(6) 



j:^^,N{ix^,x'),a^) + \X\/2 

is used. This empirical estimator is akin to the update of 
a Dirichlet-1/2 prior with a multinomial likelihood and is 
similar to that considered by Krichevsky and Trofimov |11|. 

Similarly, at each point in time t, given the context 
(X*^, = (x^o'^^i) e X^xA^-i, for each x^^+i e X and 

€ A, the quantity J{x^^^, a^) is an estimate of the cost-to- 
go if the action At = ai is selected and then observation 
Xt+i = xe+i is subsequently realized. This estimate is 
initialized to be zero, and subsequently refined by iterating 
the dynamic programming operator from OJ backwards over 
outcomes that have been previously realized in the system 
trajectory, using P to estimate the probability of each possible 
outcome (line [T6l l. 

At each time t, an action At is selected either with the 
intent to explore or to exploit. In the former case, the action is 
selected uniformly at random from among all the possibilities 
(line|3l. This allows the action space to be fully explored and 
will prove critical in ensuring the quality of the estimates 
P and J. In the latter case, the impact of each possible 
action on all future costs is estimated using the transition 



probability estimates P and the cost-to-go estimates J, and 
the minimizing action is taken acting greedily with respect 
to P and J (line fTOll. A sequence {7^} controls the relative 
frequency of actions taken to explore versus exploit; over 
time, as the system becomes well-understood, actions are 
increasingly chosen to exploit rather than explore. 

Note that the active LZ algorithm can be implemented easily 
using a tree-like data structure. Nodes at depth £ correspond to 
contexts of the form {x^, a^~^) that have already been visited. 
Each such node can Unk to at most |X| | A| child nodes of the 
form (x^+^j a^) at depth £+1. Each node (x^"*"^, a^) maintains 
a count N{x^^^,a^) of how many times it has been seen as 
a context and maintains a cost-to-go estimate J(x^+^,a^). 
The probability estimates P need not be separately stored, 
since they are readily constructed from the context counts 
according to Each phrase interval amounts to traversing a 
path from the root to a leaf, and adding an additional leaf. After 
each such path is traversed, the algorithm moves backwards 
along the path (lines [TTI - [T9b and updates only the counts 
and cost-to-go estimates along that path. Note that such an 
implementation has linear complexity, and requires a bounded 
amount of computation and storage per unit time (or symbol). 

We will shortly establish that the active LZ algorithm 
achieves the optimal long-term average cost. Before launching 
into our analysis, however, we next consider employing the 
active LZ algorithm in the context of our running example of 
the game of Rock-Paper-Scissors. We have already seen how a 
player in this game can minimize his long-term average cost 
if he knows the opponent's finite-memory strategy. Armed 
with the active LZ algorithm, we can now accomplish the 
same task without knowledge of the opponent's strategy. In 
particular, as long as the opponent plays using some finite- 
memory strategy, the active LZ algorithm will achieve the 
same long-term average cost as an optimal response to this 
strategy. 

Example [T] (Rock-Paper-Scissors). The active LZ algorithm 
begins with a simple model of the opponent — it assumes that 
the opponent selects actions uniformly at random in every 
time step, as per line |4] The algorithm thus does not factor in 
play in future time steps in making decisions initially, as per 
line 15] As the algorithm proceeds, it refines its estimates of 
the opponent's behavior. For game t + 1, the current context 
(X* , A^^^) specifies a recent history of the game. Given 
this recent history, algorithm can make a prediction of the 
opponent's next play according to P, and an estimate of 
the cost-to-go according to J. These estimates are refined 
as play proceeds and more opponent behavior is observed. If 
these estimates converge to their corresponding true values, 
the algorithm makes decisions (line [TOt that correspond to the 
optimal decisions that would be made if the true transition 
kernel and cost-to-go function were known, as in (|4|i-(|5ll. 

A. Numerical Experiments with Rock-Paper-Scissors 

Before proceeding with our analysis that establishes the 
average cost optimality of the active LZ algorithm, we 
demonstrate its performance on a simple numerical example 
of the Rock-Paper-Scissors game. The example will highlight 
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Algorithm 1 The active LZ algorithm, a Lempel-Ziv inspired algorithm for learning. 
Input: a discount factor a E (0,1) and a sequence of exploration probabilities {74 } 
1: c <— 1 {the index of the current phrase} 
Tc ^ 1 {start time of the cth phrase} 
N{-) ^ {initialize context counts} 
P(-) ^ 1/|X| {initialize estimated transition probabilities} 
J(-) ^ {initialize estimated cost-to-go values} 
for each time t do 
observe Xt 

if N{X^ jA^jT^) > then {are in a context that we have seen before?} 
with probability jt, pick At uniformly over A {explore independent of history} 
with remaining probability, 1 — jt, pick At greedily according to P, J: 



10 



At e argmin^ P(a;t+i|X*^,(A^/, at)) [g(Xt, a*, xj+i) + aj((X*^, xt+i), (A^/, a*)) 

{exploit by picking an action greedily} 
else {we are in a context not seen before} 
pick At uniformly over A 

for s with Tc < s < t, in decreasing order do {traverse backward through the current context} 
update context count: N{X^^, A'^-'^) ^ N{X'^^,A%-^) + 1 
update probability estimates: for all G X 

N({X?.~^,Xs),At-^) + 1/2 



E,,iv((x^r\x'),A^:^) + |x|/2 

16: update cost-to-go estimate: 

J{Xl^,A%-^) ^ min V P(x,+i|X^^, (A^;\a,)) [g(X„ a„ x.+i) + aJ((X^^, x,+i), {A%-\a,)) 



end for 

c*— c+1, Tc^< + 1 {start the next phrase} 
end if 
end for 



the importance of making decisions that optimize long-term 
costs. 

Consider a simple opponent that plays as follows. If, in the 
previous game, the opponent played rock against scissors, the 
opponent will play rock again deterministically. Otherwise, 
the opponent will pick a play uniformly at random. It is easy 
to see that an optimal strategy against such an opponent is 
to consistently play scissors until (rock, scissors) occurs, play 
paper for one game, and then repeat. Such a strategy incurs 
an optimal average cost of —0.25. 

We will compare the performance of the active LZ algorithm 
against this opponent versus the performance of an algorithm 
(which we call 'predictive LZ') based on the Lempel-Ziv 
predictor of Martinian lfT2l . Here, we use the Lempel-Ziv 
algorithm to predict the opponent's most likely next play based 
on his history, and play the best response. Since Lempel- 
Ziv offers both strong theoretical guarantees and impressive 
practical performance for the closely related problems of 
compression and prediction, we would expect this algorithm 
would be effective at detecting and exploiting non-random 
behavior of the opponent. Note, however, such an algorithm 
is myopic in that it is always optimizing one-step costs and 
does not factor in the effect of its actions on the opponent's 



future play. 

In Figure [1] we can see the relative performance of the 
two algorithms. The predictive LZ algorithm is able to make 
some modest improvements but gets stuck at a fixed level 
of performance that is well below optimum. The active LZ 
algorithm, on the other hand is able to make consistent 
improvements. The time required for convergence to the 
optimal cost does, however, appear to be substantial. 

IV. Analysis 

We now proceed to analyze the active LZ algorithm. In 
particular, our main theorem. Theorem |2] will show that the 
average cost incurred upon employing the active LZ algorithm 
will equal the optimal average cost, starting at any state. 

A. Preliminaries 

We begin with some notation. Recall that, for each c > 1, 
Tc is the starting time of the cth phrase, with t\ = \. Define 
c(t) to be index of the current phrase at time t, so that 

c(i) = sup {c > 1 : Tc < <}. 
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Fig. 1. Performance of the active LZ algorithm on Rock-Paper-Scissors relative to the predictive LZ algorithm and the optimal policy. 



At time t, the current context will be (Xl , At ^ ). We 
define the length of the context at time t to be d{t) = t — 



Tc{t) 



1. 



The active LZ algorithm maintains context counts N, 
probability estimates P, and cost-to-go estimates J. All of 
these evolve over time. In order to highlight this dependence, 
we denote by Nt, Pt, and Jt, respectively, the context counts, 
probability estimates, and cost-to-go function estimates at time 
t. 

Given two probability distributions p and q over X, define 
TV(p, q) to be the total variation distance 

TV(p,g)^i^|p(a;)-<z(x)|. 



B. A Dynamic Programming Lemma 

Our analysis rests on a dynamic programming lemma. This 
lemma provides conditions on the accuracy of the probability 
estimates Pt at time t that, if satisfied, guarantee that actions 
generated by acting greedily with respect to Pt and Jt are 
optimal. It relies heavily on the fact that the optimal cost-to- 
go function can be computed by a value iteration procedure 
that is very similar to the update for Jt employed in the active 
LZ algorithm. 

Lemma 1 . Under the active LZ algorithm, there exist constants 
K > 1 and e € (0, 1) so that the following holds: Suppose 
that, at any time t > K, when the current context is 

(X*^^^,,A^) = (x^a^-l), we have 

(i) The length s = d(t) of the current context is at least K. 



(ii) For all £ with s < £ < s + and all {x 



-1), the 

context (cc^, a^~^) has been visited at least once prior to 
time t. 

the 



(iii) For all i with s < £ < s + K and all {x 



distribution Pt{-\x ,a ) satisfies 



TV {Pti-\x\a'), atx+i) < ^ 



Then, the action selected by acting greedily with respect to 
Pt and Jt at time t (as in line [10] of the active LZ algorithm) 
is a-discounted optimal. That is, such an action is contained 
in the set of actions ^* 

Proof: First, note that there exists a constant e > so that 
if P : X ^ [0, 1] and J : X^ x A^'-^ - 
arbitrary functions with 



are two 



P{-\x^,a^)-P{-\x 



|i < e, V a; 



K 



\J{x^,a^-^)~J*{x^,a^-^)\<e, V x^, a 



K-1 



(7) 



(8) 



then acting greedily with respect to (P, J) results in actions 
that are also optimal with respect to {P, J*) — that is, an 
optimal policy. The existence of such an e follows from the 
finiteness of the observation and action spaces. 

Now, suppose that, at time t, the hypotheses of the lemma 
hold for some (e, K), and that the current context is (x'*, a*~^), 
with s — d{t). If we can demonstrate that, for every e A, 



\Pt{xs+i\x'' ,a') - P{xs+i\xl_K+i,al_K+{) 



and 



max 



Jt{x'^ ,a*) Ja{^l-K+2T^'s-K+2) 



< e, 



(9) 



(10) 



then, by the discussion above, the conclusion of the lemma 
holds. dSli is immediate from our hypotheses if e < e/2. 

It remains to establish (fTOl i. In order to do so, fix a choice 
of Xs+i and a^. To simplify notation in what follows, we will 
suppress the dependence of certain probabilities, costs, and 



value functions on (x* 



In particular, for all Xs+2 and 



Cs+i, define 
F 

P{Xs+2\as+l) = PiXs+2\xltK+2^'-s-K+2 



Ptix,+2\a,+i) ^ Pt{x,+2\x'+\a'+'), 



) 
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These are, respectively, estimated and true transition proba- 
bilities. Define 



5t(as+i, — g{xs, 0,3+1, Xs+2) 



to be the current cost, and define the value functions 



Then, using the fact that J* solves the Bellman equation 
^ and the recursive definition of Jt (line [16] in the active LZ 
algorithm), we have 



min Pt{xs+2\as+i) 

fls + i ^ ' 

X ^gtias+i,Xs+2) + aJtixs+2,as+i) 
min P{xs+2\as+i) 



X gt{as+i,Xs+2) + aJ^{xs+2,as+i) 



Observe that, for any v,w : A 



min v{a) — min w{a) < max \v{a) — w(a)| 



Then, 



\Mx^+\an - J:ixlt],+2^al_,,^2)\ 



< max 



X ^gtias+i,Xs+2) + aJtixs+2,as+i) 

- P{Xs+2\as+l) 



Xs + 2 



X [gtias+i,Xs+2) + aJaixs+2,as+i)] 



It follows that 

\Jt{x'+\a')-j:{x'+^ 



a max 



a max 

a.s+1 



K+2^ '^s-K+2) 



X |^Pt(2:s+2|a,s+l) - P{Xs+2\as+l) 
P{Xs+2\as+l) 



Xs + 2 



JaiXs+2,as+l) - Jt{Xs+2,as+i) 



Using the fact that \ jt\ < .gmax/(l ~ a), since it represents a 
discounted sum. 



\Jt{x^ ,a'*) Jai^l~K+2^'^s-K+2 

a 



)l 



+ a max 



1 — a 



Jai^s+2,as+l) - JtiXs+2,as+l) 



We can repeat this same analysis on the \J*{xs+2,o,s+i) — 
Jt{x 3+2,03+1)1 term. Continuing this K times, we reach the 
expression 



\Mx^^\a^)-j:{xlt], 



< SgmaxC 

- l-a 



a 



-2' "■s-K+2 
K-1 



)l 



E£ ffmax 
. l-a (11) 



=0 



1 — a J 1 — a 



It is clear that we can pick e sufficiently small and K 
sufficiently large so that e < e/2 and the right hand size 
of (fTTT i is less than e. Such a choice will ensure that (l9b-(fT0l) 
hold, and hence the requirements of the lemma. ■ 
Lemma [1] provides sufficient conditions to guarantee when 
the active LZ algorithm can be expected to select the correct 
action given a current context of {x'',a^^^). The sufficient 
conditions are a requirement the length of the current context, 
and on the context counts and probability estimates over all 
contexts (up to a certain length) that have {x'^,a^~^) as a 
prefix. 

We would like to characterize when these conditions hold. 
Motivated by Lemma [T] we define the following events for 
ease of exposition: 

Definition 1 (e-One-Step Inaccuracy). Define to be the 
event that, at time t, at least one of the following holds: 
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(i) Tv(A(-|x*^,,,,<,,,),p(-|x*_^+i,^*_^+j) >6-. 

(ii) The current context {X^^^^^ , A^T^^^ ) has never been visited 
prior to time t. 

If the event holds, then at time t the algorithm either 
possesses an estimate of the next-step transition probabiUty 
Pt{-\X^ , A\. ) that is more than e inaccurate relative to the 
true transition probabilities, under the total variation metric, 
or else these probabilities have never been updated from their 
initial values. 

Definition 2 (e, ^-Inaccuracy). Define Bl'^ to be the event 
that, at time t > K, either 

(i) The length d{t) of the current context is less than K. 

(ii) There exist £ and {x^,a^) such that 

(a) d{t) <e< d{t) + K. 

(b) (x^, a^) contains the current context {X\ , ^t"^)) as 
a prefix, that is, 

d(t) _ „d(t)-l _ At-l 

(c) The estimated transition probabilities Pt{-\x^,a^) are 
more than e inaccurate, under the total variation metric, 
and/or the context (cc^,a^~^) has never been visited 
prior to time t. 

From Lemma [1] it follows that if the event B^'^ does not 
hold, then the algorithm has sufficiently accurate probability 
estimates in order to make an optimal decision at time t. 

Our analysis of the active LZ algorithm proceeds in two 
broad steps: 

1) In Section HV-CI we establish that e-one-step inaccuracy 
occurs a vanishing fraction of the time. Next, we show 
that this, in fact, suffices to establish that e, ^-inaccuracy 
also occurs a vanishing fraction of the time. By Lemma[T] 
this implies that, when the algorithm chooses to exploit, 
the selected action is sub-optimal only a vanishing 
fraction of the time. 

2) In Section IIV-DI by further controlling the exploration 
rate appropriately, we can use these results to conclude 
that the algorithm attains the optimal average cost. 



C. Approximating Transition Probabilities 

We digress briefly, to discuss a result from the theory of 
universal prediction: given an arbitrary sequence {yt], with 
yt € Y for some finite alphabet Y, consider the problem 
of making sequential probability assignments Qt-i{-) over 
Y, given the entire sequence observed up to and including 
time t — 1, y*~^, so as to minimize the cost function 
Y^=i ~ log(5t_i(?/t), for some horizon T. It has been shown 
by Krichevsky and Trofimov IITTI that the assignment 



Lemma 2. 



ti.y) 



Ntiv) + 1/2 
i+|Y|/2 ' 



(12) 



where Nt{y) is the number of occurrences of the symbol y 
up to time t, achieves: 



^logQt_i(yt) - min 



^^og q{yt) 



<^logT + 0(l), 



where the minimization in taken over the set A^(Y) of all 
probability distributions on Y. 

Lemma |2] provides a bound on the performance of the 
sequential probability assignment ( fT2l l versus the performance 
of the best constant probability assignment, made with knowl- 
edge of the full sequence y'^ . Notice that ( fT2l l is precisely 
the one-step transition probability estimate employed at each 
context by the active LZ algorithm (line [TSl l. 

Returning to our original setting, define Pmin to be the 
smallest element of the set of non-zero transition probabilities 



{P{XK+1\ 



P{xK+l\x 



>0} 



The proof of the following lemma essentially involves invok- 
ing Lemma |2] at each context encountered by the algorithm, the 
use of a combinatorial lemma (Ziv's inequality), and the use 
of the Azuma-Hoeffding inequality (see, for example, |13|). 
Part of the proof is motivated by results on Lempel-Ziv based 
prediction obtained by Feder et al. Ifl4l . 

Lemma 3. For arbitrary e' > 0, 



Pr 



\ t=K 



Ki log log r 

2e2 IokT 
< exp 



2e2 



Te'- 



81og2((2T+|X|)M„in)y ' 



where Ki is a constant that depends only on |X| and \A\. 

Proof: See Appendix lAl ■ 
Lemma [3] controls the fraction of the time that the active 
LZ algorithm is e-one-step inaccurate. In particular. Lemma [3] 
is sufficient to establish that this fraction of time goes to 
(via a use of the first Borel-Cantelli lemma) and also gives 
us a rate of convergence. 

It turns out that if the exploration rate jt decays sufficiently 
slowly, this suffices to ensure that the fraction of time the 
algorithm is e, _R^-inaccurate goes to as well. To see this, 
suppose that the current context at time t is (X* , ) = 
(x*,a'*~^), and that the algorithm is e, A'-inaccurate (i.e., the 
event B^'^ holds). Then, one of two things must be the case: 

> The current context length s is less than K. We will 
demonstrate that this happens only a vanishing fraction 
of the time. 

• There exists {x^,a^), with s < £ < s + K, so that 
either the estimated transition probability distribution 
Pt{-\x^,a^) is e inaccurate under the total variation 
metric, or the context (x^,a^~^) has never been visited 
in the past. The probability that the realized sequence of 
future observations and actions {Xf^'^^^'' , A^'^^^'') will 
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indeed correspond to (a;^^_]^,a^) is at least 

t+l-s 

n 

ni—t 

where Pmin is the smallest non-zero transition probabil- 
ity. Thus, with this minimum probability, a e-one-step 
inaccurate time will occur before the time t + K. Then, 
if the exploration probabilities {7m} decays sufficiently 
slowly, would be impossible for the fraction of e-one- 
step inaccurate times to go to without the fraction of 
e, ^-inaccurate times also going to 0. 

By making these arguments precise we can prove the following 
lemma. The lemma states that the fraction of time we are at a 
context wherein the assumptions of Lemma [1] are not satisfied 
goes to almost surely. 

Lemma 4. Assume that 

It > (ai/logt)i/('^^^), 

for arbitrary constants ai > and 02 > L Further assume 
that {7t} is non-increasing. Then, 



1 ^ 



t=K 



Proof: First, we consider the instances of time where the 
current context length is less than K. Note that 

T c(T)r,+i-l 
E hd(t)<K} < E E I{t-rc + l<if} 

t = K C=l t = Ta 

c{T) 

<^K =^ Kc{T). 

c=l 

Applying Ziv's inequality (Lemma |5]), 



1 ^ 

tE W^}^^!- logT 



= 0. (13) 



Next, define Bt to be the event that an e-one-step inaccurate 
time occurs between t and t + K inclusive, that is 

t+K 

Bt^[j II. 

s=t 

It is easy to see that 



From Lemma |3] we immediately have, for arbitrary e' > 0, 



( ^ E Ib. > 



t=K 



{K+l)Ki log log T 
2e2 logT 

{K+l)e' 



+ 



(14) 



T 



< exp 



Te' 



81og^((2r+|X|)M,i„) 



Define Ht to be the event that Bl'^ holds, but d{t) > K. 
The event Ht holds when, at time i, there exists some context, 
up to K levels below the current context, which is e-one-step 
inaccurate. Such a context will be visited with probability at 
least 

t+K 

Pmin Y[lm> bmin7t+i?)^"^S 
m—t 

in which case Bt holds. Consequently, 

EfeJ^t] > {Pminlt+Kf+'lm- 

Since jt is non-increasing, 

' E E[I^, \Tt] > ^ (15) 



T 



t=K 



t=K 



Now define, for i = 0,1, . . . , K — 1 and n > 0, martingales 
Mn^ adapted to G^^ = J^^+nK+i^ according to M^'^ = 0, 
and, for n > 0, 

n-l 

(i) (i) 

Since we have via the Azuma-Hoeffding 

inequality, for arbitrary e" > 0, 

Pr (a4'^ > "e") < exp (-ne"V8) (16) 
For each i, let rii (T) be the largest integer such that K + 



,{T)K + i < T, so that 



K-l 



^{T)■ 



t=K 



1=0 



Since ni(T) < the union bound along with ( fT6b then 
implies that: 



1 ^ 



< 



K +1 



T+K 



Pr K]lB, -E[lBj.Ft]>re 

\t=K 



t=K 



T 



i=0 
K~l 



(17) 



t=K 
T 



< 



K+l 



T 



t=K 



{K+lf 
T 



Ecxp(-r2e"V8XV(r) 



i=0 

< if exp (^-Te"^/8K 



IEEE TRANSACTIONS ON INFORMATION THEORY 



10 



Now, define 



(Pmin7T+ 



K-1 



\K+1 



{K + l)Kilog\ogT 
logT 



2e2 



{K+l)e'{T) {K + lf 



2e2 



T 



+ e"{T) 



with 



g'fr) A ///yN A _}_ 

^ ' logT' ^ ' logT- 



It follows from ([Uli, (O, and ([T7]l that 



< exp — 



+ K exp 



8l0g*((2T+|X|)/pn.in) 



Siflog-'T 
By the first Borel-Cantelli lemma, 

Note that the hypothesis on 74 implies that k{T) — > as 
T ^00. Then, 



T-l 



lim - y I/j = 1, 



a.s. 



(18) 



Finally, note that 

1 ^ ^ T ^ T 

t=K t=K t=K 

The result then follows from (fTSl l and ( fTSl l. 



D. Average Cost Optimality 

Observe that if the active LZ algorithm chooses an action 
that is non-optimal at time t, that is, 

then, either the event B^'^ holds or the algorithm chose to 
explore. Lemma |4] guarantees that the first possibility happens 
a vanishing fraction of time. Further, if jt I 0, then the 
algorithm will explore a vanishing fraction of time. Combining 
these observations give us the following theorem. 

Theorem 1. Assume that 

7t > (ai/logt)i/('^^^\ 

for arbitrary constants ai > and 02 > 1. Further, assume 
that 7t i 0. Then, 

1 ^ 



Proof: Given a sequence of independent bounded random 
vai-iables {Z„}, with E[Z„] 0, 



1 " 

lim — > Zn = 0, a.s. 

n=l 



This follows, for example, from the Azuma-Hoeffding in- 
equality followed by the first Borel-Cantelh lemma. This 
immediately yields 



lim 

T->oo 



rjn / J 11- {exploration at time i\ ^ Oi a.S., 



(19) 



t=k 



provided 7* ^ (note that the choice of exploration at each 
time t is independent of all other events). Now observe that 

.e, 

't- 



C U {exploration at time t). 



Combining ( fT9l l with Lemma |4] the result follows. ■ 
Assumption [T] guarantees the optimal average cost is A*, 
independent of the initial state of the Markov chain, and 
that there exists a stationary policy that achieves the optimal 
average cost A*. By the ergodicity theorem, under such a 
optimal policy. 



1 ^ 



(20) 



On the other hand. Theorem [T] suggests that, under the active 
LZ algorithm, the fraction of time at which non-optimal 
decisions are made vanishes asymptotically. Combining these 
facts yields our main result. 

Theorem 2. Assume that 

It > (ai/logi)'/("^^', 

for arbitrary constants ai > and 02 > 1, and that 74 J, 0. 
Then, for a e (0, 1) sufficiently close to 1, 



1 



under the active LZ algorithm. Hence, the active LZ algorithm 
achieves an asymptotically optimal average cost regardless of 
the underlying transition kernel. 

Proof: Without loss of generality, assume that the cost 
g{Xt, At, Xt+i) does not depend on Xt+i- 

Fix e > 0, and consider an interval of time > K. 
For each {x^,a^) € x A^, define a coupled process 
{Xt{x'^, a''), At{x^ , a^)) as follows. For every integer n, set 



and 



J^{n-l)T,+K, K _ K 

(n-l)T, + l V-^ ^ — -^1 ' 



"^(n-l)T, + l ' " J — "1 



For all other times t, the coupled processes will choose actions 
according to an optimal stationary policy, that is 
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Without loss of generality, we will assume that the choice of 
action is unique. 

Now, for each n there will be exactly one {x^,a^) that 
matches the original process {Xt, At) over times {n — 1)T^ + 
1 <t<{n- l)Te + K, that is, 

n'<\- /y(n-l)T.+K .(n-l)T,+K 
V-^ / ^ \^^(n-l)T,+l '"^(n-l)T, + l 

For the process indexed by (x^, a^), for (n - l)r, + K < 
t < nT^, if 

then set Xt{x^,a^) = Xf. Otherwise, allow Xt{x^,a''^) 
to evolve independently according to the process transition 
probabilities. Similarly, allow all other the processes to evolve 
independently according to the proper transition probabilities. 
Define 



t=(n-l)Tj + l 



Note that each Gn{x^ , a^) is the average cost under an 
optimal policy. Therefore, because of ( l20l i. we can pick Tg 
large enough so that for any n, 



max \GJx^,a^) - X* 



< e. 



(21) 



Define Z„ to be the event that, within the nth interval, the 
algorithm chooses a non-optimal action. That is, 

Z„ ^ {3 t, (n- 1)T, < t < nT„ At i Al{X\A'-^)) . 

Set 



E. 



N 



N ^ 



Then, 



ATT, 



\-Y^{g{Xt,At)-\*) 



NT, 
< 



t=i 
maxd^max 



N 



1 ^ 

— Til-- 

NT, 



t={n-l)Tt + l 



Note that, from Theorem [T] Epf/N almost surely as 
N oo. Thus, 



lim sup 



Y.^giXt,At)-X*) 

t=l 

1 ^ 

<limsup-^(l-l2j 



n=l 
1 



Y {9{Xt,At)-Xn 



t=(n-l)Tj + l 



Notice that when Iz^ = 0, we have for some {x^,a^) that 
Xt{x^, a^) = Xt for all (n - 1)T, < t < nT,. Thus, 



lim sup 



NT,_ 



^-^Y.^g{Xt,At)-X*) 
1 ^ 

< limsup — y (1 -I2J max |G„(x^, a^) - A* 

1 ^ 

<limsup — > max |G„(a;^, a^) — A* 



However, the variables 



max G„(x",a")-A* 

are independent and identically distributed as n varies. Thus, 
by the Strong Law of Large Numbers and dTTT i. 



lim sup 

T-+00 



if](5(Xt,^)-A*) 



with probability 1. Since e was arbitrary, the result follows. 



E. Choice of Discount Factor 

Given a choice of a sufficiently close to 1, the optimal a- 
discounted cost policy coincides with the average cost optimal 
policy. Our presentation thus far has assumed knowledge of 
such an a. For a given a, under the assumptions of Theorem!!] 
The active LZ algorithm is guaranteed to take a-discounted 
optimal actions a fraction 1 of the time which for an ad-hoc 
choice of a sufficiently close to 1 is likely to yield good 
performance. Nonetheless, one may use a 'doubling- trick' in 
conjunction with the active LZ algorithm to attain average cost 
optimality without knowledge of a. In particular, consider the 
following algorithm that uses the active LZ algorithm, with the 
choice of {74} as stipulated by Theorem [1] as a subroutine: 

Algorithm 2 The active LZ with a doubling scheme. 
1: for non-negative integers k do 
2: for each time 2^ <t' < 2^+^ do 

3: Apply the active LZ algorithm (Algorithm [U with a = 

1 — /3fc, and time index t = t' ~ 2^. 
4: end for 
5: end for 

Here [3k is a sequence that approaches sufficiently slowly. 
One can show that if (ik = loglogfc), then the above 
scheme achieves average cost optimality. A rigorous proof of 
this fact would require repetition of arguments we have used 
to prove earlier results. As such, we only provide a sketch that 
outlines the steps required to establish average cost optimality: 

We begin by noting that in the fcth epoch of Algorithm |2] 
one choice (so that Lemma [1] remains true) is to let ek,Kk 
grow as a approaches 1 according to ik — and 
Kk = respectively. If Pk = loglogfc), then 

for the fcth epoch of Algorithm |2] Lemma |4] is easily 
modified to show that with high probability the greedy 
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action is suboptimal over less than 2^k{2^) time steps where 
k(2'=) O((loglog2'=)3/log2'=). The Borel-CantelU Lemma 
may then be used to establish that beyond some finite epoch, 
over all subsequent epochs k, the greedy action is suboptimal 
over at most 2^k,{2^) time steps. Provided Pk — * 0, this 
suffices to show that the greedy action is optimal a fraction 1 
of the time. Provided one decreases exploration probabilities 
sufficiently quickly, this in turn suffices to establish average 
cost optimahty. 

F. On the Rate of Convergence 

We limit our discussion to the rate at which the fraction of 
time the active LZ algorithm takes sub-optimal actions goes to 
zero; even assuming one selects optimal actions at every point 
in time, the rate at which average costs incurred converge to 
A* are intimately related to the structure of P which is a 
somewhat separate issue. Now the proofs of Lemma |4] and 
Theorem [T| tell us that the fraction of time the active LZ 
algorithm selects sub-optimal actions goes to zero at a rate 
that is 0((l/logr)'^) where c is some constant less than 1. 
The proofs of Lemmas [3] and |4] reveal that the determining 
factor of this rate is effectively the rate at which the transition 
probability estimates provided by P converge to their true 
values. Thus while the rate at which the fraction of sub- 
optimal action selections goes to zero is slow, this rate isn't 
surprising and is shared with many Lempel-Ziv schemes used 
in prediction and compression. 

A natural direction for further research is to explore the 
effect of replacing the LZ-based context tree data structure 
by the context-tree weighting method of Willems et al. {T5\. 
It seems plausible to expect that such an approach will yield 
algorithms with significantly improved convergence rates, as 
is the case in data compression and prediction. 
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V. Conclusion 



We have presented and established the asymptotic opti- 
mality of a Lempel-Ziv inspired algorithm for learning. The 
algorithm is a natural combination of ideas from information 
theory and dynamic programming. We hope that these ideas, in 
particular the use of a Lempel-Ziv tree to model an unknown 
probability distribution, can find other uses in reinforcement 
learning. 

One interesting special case to consider is when the next 
observation is Markovian given the past K observations and 
only the latest action. In this case, a variation of the active 
LZ algorithm that uses contexts of the form [x^ , a) could be 
used. Here, the resulting tree would have exponentially fewer 
nodes and would be much quicker to converge to the optimal 
policy. 

A number of further issues are under consideration. It would 
be of great interest to develop theoretical bounds for the rate of 
convergence. Also, it would be natural to extend the analysis 
of our algorithms to systems with possibly infinite dependence 
on history. One such extension would be to mixing models, 
such as those considered by Jacquet et al. [8 |. Another would 
be to consider the the optimal control of a partially observable 
Markov decision process. 
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Appendix A 
Proof of Lemma [3] 

An important device in the proof of Lemma [3] the following 
combinatorial lemma. A proof can be found in Cover and 
Thomas |16J. 

Lemma 5 (Ziv's Inequality). The number of contexts seen by 
time T, c{T), satisfies 



c{T)< 



C2T 
logT' 



where C2 is a constant that depends only on |X| and |A|. 

Without loss of generality, assume that Xt and At take 
some fixed but arbitrary values of —K + 2 < t < 0, so that 
the expression P{Xt+i\X^_j^^j^, A\_^j^-^) is well-defined for 
all t > 1. We will use Lemma |2] to show: 

Lemma 6. 



log log r 

logT ' 



where Ki is a positive constant that depends only on |X| and 



Proof: Observe that the probability assignment made by 
our algorithm is equivalent to using ( fT2b at every context. In 
particular, at every time t. 



PLACE 

PHOTO 
HERE 



Tsachy Weissman Tsachy Weissman obtained his 
undergraduate and graduate degrees from the De- 
partment of electrical engineering at the Technion. 
Following his graduation, he has held a faculty posi- 
tion at the Technion, and postdoctoral appointments 
with the Statistics Department at Stanford University 
and with Hewlett-Packard Laboratories. Since the 
summer of 2003 he has been on the faculty of the 
Department of Electrical Engineering at Stanford. 
Since the summer of 2007 he has also been with 
the Department of Electrical Engineering at the 
Technion, from which he is currently on leave. 

His research interests span information theory and its applications, and 
statistical signal processing. He is inventor or co-inventor of several patents 
in these areas and involved in a number of high-tech companies as a researcher 
or member of the technical board. 

His recent prizes include the NSF CAREER award, a Horev fellowship for 
leaders in Science and Technology, and the Henry Taub prize for excellence 
in research. He is a Robert N. Noyce Faculty Scholar of the School of 
Engineering at Stanford, and a recipient of the 2006 IEEE joint IT/COM 
societies best paper award. 



Pt{Xt+AXl^^^,A\^^^^) 

For each {x^,a^), define TT{x^,a^) to be the set of times 
TT{x^a^)^{t : l<t<T, {X^^^^, A^J = {x^a^} 
It follows from Lemma |2] that 

< min - V \ogp{Xt+i) 



IXI 



loglTrfx^a-'ll+Ci. 



Summing this expression over all distinct {x^,a^) that have 
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occurred up to time T, 

T 



J-T+\ as follows: set with Mg = 0, and, for T > 1, 
Mt = — ^ At ^ 



mm - V logp(Xt+i) 

teT{xi,a3) 



|X| 



(a:3,aJ) 

+ E 

T 

< -^l0gP(X,+i|X*_^+l,^*_^+l) 



EE 



t=l 



log 



log|rT(x^aJ)|+Ci 



(22) 



= E logl^TT 7* T + ^* • 

^ \ -f^l-^t+il^t-K+i'^t-x+iJ / 

It is clear that Mt is a martingale with E[AfT] = 0. Further, 

> \ogPt{Xt+^\Xl^^^,AU > log(l/(2t+ |X|)). 



E 



^\og\TT{x^,a^)\+Ci 



and 



> logP{Xt+i\XlK+i,AUK+,) > logp„ 



Now, c(r) is the total number of distinct contexts that 
have occurred up to time T. Note that this is also precisely 
the number of distinct {x^,a^) with \Tt{x^ ,a^)\ > 0. Then, 
by the concavity of log(-). 



so that 



\Mt - Mt-i \ < 2 log 



2T 



Pn 



An application of the Azuma-Hoeffding inequality then yields, 
for arbitrary e' > 0, 



E 

(2;J,aJ) 



yiog|TT(x^a-'")|+Ci 



< 



■"^(^)iog^ + c,c(r). 



Pr ( ^ > e' ) < exp 



< exp 



2J2 



c{T) 



8ELiW((2T+|X|)/p„in)^ 

IV^ 

'81og2((2T+|X|)M„in) 



Applying Lemma |5] 



E 

(xi ,a3) 

C2IXI T 



< 



[log log T- log C2] 



We are now ready to prove Lemma [3] 
Lemma [3] For arbitrary e' > 0, 



Pr 



2 logT 
T 

C1C2 



(23) 



i^iloglogT e' 
2^ logT 2^ 



< exp 



■logT 

The lemma follows by combining (l22l l and (l23T l. ■ 
For the remainder of this section, define Af to be the 
Kullback-Leibler distance between the estimated and true 
transition probabilities at time t, that is 



Te'- 



81og^((2T+|X|)M„in) 



where Ki is a constant that depends only on |X| and \A\. 
Proof: Define 

We have 



A* ^i?(P(.|X*_^+i,A*_K+i)||i^*(-|^*.<,,,<„)J ■ 
Lemma 7. For arbitrary e' > 0, 



t=l 



\ t=i 



log ^^'■'"f '-■5-'.+ A. 



-P(-'^t+l l-'^t-A'+l, ^t-K+l) 



>^E%-^-}- 

t=i 



(24) 



< exp 



Te' 



12 



81og"((2T+|X|)/p„,i„) 



Here, the first inequality follows by the non-negativity of 
Kullback-Leibler distance. The second inequality follows from 
Pinsker's inequahty, which states that TV(-, •) < ^jD{-\\-)/2. 

Now, let Ft be the event that the current context at time 
t, (Xi ,Ai~^ ) has never been visited in the past. Observe 
that, by Lemma |5] 



Proof: Define, for T > 0, a process {Mt} adapted to 



t=i 



C2T 
logT- 



(25) 
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Putting together ( l24b and ( IZST i with the definition of the 
event XI, 



T , T 



— Va 



rp / , Xj — 2^ 

t=K t=\ 



Then, 



T^'^ ^' - 2e2 logT ■ 2e2 ' logT 



2e2T^ logT 



^ i^i loglogT , e' , Cz 



log log r 

logT 

By Lemma |6] and Lemma [T] we have 
log log T A 



logT 

< exp 



Te'- 



log^((2T+|X|)/p„ 



This yields the desired result by defining the constant K\ 



